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1  Introduction 


Tin  Speech  Communication  Group  at  the  Research  Laboratory  of  Electronics, 
Massachusetts  Institute  of  Technology  submits  the  final  report  for  contract  N00039- 
85- C- 0341,  awarded  by  the  Information  Science  Technology  Office  of  the  Defense 
Advanced  Research  Project  Agency,  as  monitored  by  the  Naval  Space  and  Warfare 
Systems  Command.  The  contract  covers  the  24-month  period  starting  on  June  27, 
1985,  and  is  awarded  for  the  development  of  an  acoustic-phonetic  database,  to  be 
used  by  the  research  community  of  the  Strategic  Computing  Speech  Program. 


The  development  ofSW'database  is  thought  to  be  crucial  to  the  speech  pro¬ 
gram  because  the  acoustic  realization  of  phonemes  depends  on  complex  interactions 
among  a  multitude  of  factors.  Therefore,  in  order  to  successfully  develop  a  speaker- 
independent,  phonetically- based  speech  recognition  system,  a  large  body  of  speech 
data,  collected  from  many  speakers,  is  needed  to  help  us  discover  and  quantify  these 
context-dependent  phenomena.  In  addition,  the  speech  database  can  serve  two  other 
functions.  First,  it  can  be  used  for  training  certain  speech  recognition  systems.  For 
some  algorithms,  such  as  hidden  Markov  modelling  (HMM),  a  large  amount  of  train¬ 
ing  data  is  needed  to  obtain  stable  estimates  of  the  parameters  of  the  stochastic 
models.  For  rule-based  algorithms,  substantial  amounts  of  data  are  also  needed  in 
order  to  set  proper  thresholds  on  speech  parameters.  Second,  the  database  can  be 
used  for  performance  evaluation.  Given  the  many  different  approaches  to  the  speech 
recognition  problem,  it  is  often  difficult  to  compare  their  relative  merits.  Testing 
specific  recognition  algorithms  or  entire  speech  recognition  systems  on  a  common 
database  will  provide  a  means  to  evaluate  their  relative  performance.  *«• 
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The  specific  responsibilities  of  MIT  in  developing  the  database^  were: 
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•  To  take  primary  responsibility  for  the  design  of  an  acoustic-phonetic  corpus, 
and  to  provide  a  detailed  analysis  of  that  corpus, 

•  To  coordinate  with  researchers  at  Texas  Instruments  and  elsewhere  in  the  spec¬ 
ification  of  the  recording  procedures  for  the  database, 

•  To  develop  a  semiautomatic  system  to  align  the  transcriptions  with  the  speech 
waveform,  with  associated  capabilities  for  researchers  to  modify  and  correct  the 
resulting  alignments, 

•  To  provide  time-aligned  orthographic  and  phonetic  transcriptions  for  the  recorded 
sentences, 
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•  To  develop  &  database  management  program,  so  that  researchers  can  easily 
access  parts  of  the  database  by  specifying  constraints  on  the  phonetic  or  lexical 
environment  of  interest,  and/or  the  speaker’s  dialect  and  sex,  and 

•  To  transfer  the  speech  database  to  NBS,  and  to  coordinate  with  researchers  at 
NBS  in  specifying  the  procedure  for  distributing  it. 

In  the  next  section,  we  describe  in  detail  the  various  tasks  associated  with  MIT’s 
database  development  effort.  Due  to  the  fact  that  the  size  of  the  database  is  consid¬ 
erably  larger  than  we  originally  proposed,  the  development  of  the  database  was  not 
completed  until  June,  1988. 


2  Task  Description 

2.1  Corpus  Design 

The  database  design  is  the  result  of  a  joint  effort  between  MIT,  SRI,  and  TI.  The 
corpus  is  comprised  of  2342  distinct  sentences  from  three  different  sets: 

1.  Two  (2)  speaker  calibration  sentences,  provided  by  SRI,  designed  to  incorporate 
phonemes  in  contexts  where  significant  dialectical  differences  axe  anticipated. 
These  two  sentences  were  spoken  by  all  talkers. 

2.  Four  hundred  and  fifty  (450)  phonetically  compact  sentences,  hand-designed  by 
MIT  with  emphasis  on  as  complete  a  coverage  of  phonetic  pairs  as  is  practical. 
Each  sentence  was  spoken  by  seven  talkers,  in  order  to  provide  a  feeling  for 
speaker  variation. 

3.  One  thousand,  eight  hundred  and  ninety  (1890)  randomly  selected  sentences, 
chosen  by  TI,  providing  alternate  contexts  and  multiple  occurrences  of  the  same 
phonetic  sequence  in  different  word  sequences.  These  were  chosen  primarily 
from  the  “Brown  corpus”  of  American  English  sentences  [1],  along  with  a  few 
sentences  from  the  Hultzen  et  al.  corpus  of  playwrights’  dialogue. 

This  combination  of  sentences  was  selected  for  its  ability  to  balance  the  conflicting 
desires  for  compact  phonetic  coverage,  contextual  diversity,  and  speaker  variability. 
It  was  decided  by  the  research  community  that  these  three  criteria  were  paramount  to 
the  initial  acoustic-phonetic  database.  Each  speaker  read  the  2  calibration  sentences, 
5  of  the  phonetically-compact  sentences,  and  3  of  the  randomly  selected  sentences, 


providing  a  total  of  10  sentences. 


The  set  of  450  compact  and  comprehensive  sentences  was  developed  at  MIT  using 
an  iterative  method  [2].  Using  ALexis  [3]  and  the  Merriam- Webster  Pocket  dictionary 
(Pocket),  we  interactively  created  sentences  and  analyzed  the  resulting  corpus.  We 
began  with  a  corpus  created  for  the  MIT  speech  spectrogram  reading  course,  which  in¬ 
cluded  basic  phonetic  coverage  and  varying  phonetic  environments.  Examining  pairs 
of  phonemes  we  augmented  these  sentences,  attempting  to  have  at  least  one  occur¬ 
rence  of  each  phoneme  doublet.  ALexis  was  used  to  search  the  Pocket  dictionary  for 
words  having  phoneme  sequences  that  were  not  represented  and  for  words  beginning 
or  ending  with  a  specific  phoneme.  We  then  created  sentences  using  the  new  words 
and  added  them  to  the  corpus.  Certain  difficult  sequences  were  emphasized,  such  as 
vowel-vowel  and  stop-stop  sequences.  For  a  more  detailed  description,  the  reader  is 
referred  to  Appendix  A. 


2.2  Analysis  of  Phonetic  Coverage 

This  section  describes  the  phonetic  coverage  of  the  compact  sentence  set  devel¬ 
oped  at  MIT  and  the  resulting  corpus  of  combined  MIT  and  TI  sentences  (heretofore 
referred  to  as  the  acoustic-phonetic  database,  or  APDB).  This  analysis  does  not  in¬ 
clude  the  calibration  sentences  as  we  consider  their  use  to  be  of  a  different  nature. 

Table  1  compares  some  of  the  distributional  properties  of  the  APDB  with  three 
other  databases:  the  Merriam  Webster  pocket  dictionary  (Pocket),  the  Harvard  Lists 
of  phonetically-balanced  sentences  (HL),  the  MIT-selected  sentences  (MIT-450),  and 
the  APDB  sentences.  The  APDB  include  seven  copies  of  each  MIT-450  sentence,  to 
account  for  the  number  of  talkers  per  sentence,  and  a  single  copy  of  each  randomly 
selected  sentence  (TI-1890). 

As  the' table  reveals,  the  proportion  of  unique  words  relative  to  the  total  number 
of  words  is  substantially  larger  in  the  MIT-450  than  the  APDB,  probably  due  to  the 
selection  procedure.  Whenever  possible,  new  words  were  used  in  sentences  and  to 
avoid  duplication.  Roughly  50%  of  the  MIT-450  words  are  unique,  as  compared  to 
only  15%  of  the  APDB  words.  The  TI-1890  sentences  are,  on  the  average,  slightly 
longer  than  those  in  the  MIT-450.  The  10  most  frequently  occurring  words  for  all  of 
the  corpora  are  function  words  or  pronouns.  In  both  the  MIT-450  and  the  APDB 
corpora,  the  most  common  word  is  "the,”  accounting  for  roughly  7%  of  all  words. 

Not  all  the  words  in  the  APDB  occur  in  Pocket.  For  these  cases,  we  generated 


Table  1:  Description  of  Databases 


! - - - - - - — 

POCKET 

HL 

MIT-450 

APDB 

no.  sentences 

720 

450 

5040 

no.  unique  words 

19,837 

1894 

1792 

6103 

no.  words 

19,837 

5745 

3403 

41,161 

ave  no.  words/sent 

mm 

7.6 

8.2 

min  no.  words/ sent 

5 

4 

2 

max  no.  words/sent 

12 

13 

19 

ave  no.  syls/word 

1M9BEEH 

1.1 

1.58 

1.54 

ave  no.  phones/ word 

3.34* 

2.97 

4.0 

3.89 

*  The  ave.  no.  syls/word  and  ave.  no.  phones/ word  have  been  weighted  by  Brown 
Corpos[l]  word  frequencies. 

phonemic  transcriptions  by  rule-based  expansion  of  the  dictionary  entries,  or,  as  a  last 
resort,  by  a  text-to-speech  synthesizer.  We  expect  that  there  are  pronunciation  vari¬ 
ations  between  the  dictionary  and  the  text-to-speech  synthesizer,  particularly  with 
respect  to  vowel  color.  There  may  also  be  some  pronunciation  errors,  but  we  think 
these  will  be  statistically  insignificant. 

Table  2  shows  the  distribution  of  within-word  consonant  sequences  for  the  four 
databases.  The  APDB  has  more  complete  coverage  of  consonant  sequences  than  the 
MIT-450,  particularly  for  the  word-final  and  word-medial  sequences.  We  examined  a 
list  of  all  of  the  word-initial  and  word-final  clusters  in  the  sentence  list,  and  compared 
these  with  the  occurrences  in  Pocket.  We  verified  that  essentially  every  initial  cluster 
that  occurred  more  than  once  in  the  Pocket  lexicon  was  included  at  least  once  in  the 
APDB,  and  that  most  of  the  final  clusters  were  covered.  Often,  if  a  word-final  cluster 
did  not  occur  in  word-final  position  in  the  APDB,  the  sequence  did  occur  within  a 
word  or  across  a  word  boundary.  Generally,  the  sequences  occurring  in  Pocket  that 
are  not  covered  by  APDB  are  from  borrowed  words  such  as  “moire”  and  “svelte.” 
The  APDB  also  contains  many  word-final  consonant  sequences  that  were  not  present 
in  MIT-450.  Since  the  Pocket  lexicon  does  not  include  suffixes,  there  are  more  word- 
final  consonant  sequences  in  the  APDB.  The  reader  is  referred  to  Appendix  A  for 
tables  summarizing  further  analyses  of  the  properties  of  the  APDB. 
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Table  2:  Distribution  of  Consonant  Sequences 


POCKET 

HL 

MIT-450 

APDB 

no.  unique  words 

19,837 

1894 

1792 

6103 

no.  word-initial 

75 

59 

64 

68 

no.  word-final 

129 

105 

102 

146 

no.  word-medial 

608 

123 

228 

388 

2.3  Data  Collection 

The  recording  of  the  data  was  primarily  TI’s  responsibility,  although  we  provided 
limited  help  in  specifying  such  things  as  the  recording  environment,  the  types  of  mi¬ 
crophone,  and  the  sampling  rate.  Speech  data  was  collected  and  recorded  utilizing 
the  Vocabulary  Master  Library  file  (VML).  630  VML  files  were  created  and  run  on 
the  STEROIDS  system  (VAX  Fortran  automated  speech  data  collection  system,  also 
known  as  the  STEReO  automatic  Interactive  Data  collection  System),  developed  by 
TI  [4]. 

The  speech  data  was  digitally  recorded  at  20  kHz  in  a  relatively  quiet  environ¬ 
ment,  simultaneously  on  a  pressure-sensitive  microphone  and  on  a  Sennheiser  close- 
talking  microphone.  Digital  tapes  were  shipped  to  NBS,  where  they  were  filtered  and 
downsampled  to  16  kHz.  The  speech  data  was  then  sent  to  MIT  to  generate  the  or¬ 
thographic  and  phonetic  transcriptions.  (The  transcriptions  procedures  are  described 
in  the  next  section  and  in  Appendices  B  and  C.) 

Each  of  630  speakers,  from  8  dialectical  regions  of  the  United  States,  read  10  sen¬ 
tences.  Table  3  provides  a  summary  of  the  number  of  speakers  from  each  of  the  dialect 
regions.  Approximately  70%  of  the  speakers  (439)  are  male  and  30%  are  female. 


2.4  Automatic  Transcription  Alignment  System  Develop¬ 
ment 

The  large  amount  of  acoustic  data  constitutes  only  one  part  of  the  speech  database. 
The  utterances  must  also  be  augmented  with  a  set  of  time-aligned  transcriptions  which 
enable  the  user  to  have  direct  access  to  specific  portions  of  the  speech  signal.  Thus, 
for  example,  a  researcher  is  able  to  query  the  database  for  all  occurrences  of  the 
phoneme  /t/  preceding  a  stressed  vowel.  In  addition,  the  researcher  has  the  ability 


Table  3:  Distribution  of  Speakers 


CSS 

location 

#  speakers 

New  England 

Northern 

North  Midland 

102 

South  Midland 

100 

5 

Southern 

99 

6 

New  York  City 

46 

7 

Western 

107 

8 

Army  Brat1 

33 

Total: 

630 

1  The  term  “Army  Brat”  is  used  to  denote  speakers  who  lived  in  several  geographical 
areas  during  their  early  lives. 

to  pinpoint  the  locations  of  the  consonantal  closures  and  releases  and  to  make  mea¬ 
surements  based  on  the  time-aligned  transcription. 

Traditionally,  the  alignment  of  a  phonetic  transcription  with  the  corresponding 
speech  waveform  is  done  manually  by  a  trained  acoustic- phonetician.  This  is  an 
extremely  time-consuming  procedure,  requiring  the  expertise  of  one  of  a  very  small 
number  of  people.  Therefore,  the  amount  of  data  that  can  be  labeled  is  limited. 
Manual  labeling  often  involves  decisions  that  are  highly  subjective  -  yielding  a  lack 
of  consistency  and  reproducibility  of  results  so  that  the  results  can  vary  substantially 
from  one  person  to  another.  In  the  past  few  years,  several  automatic  time-alignment 
procedures  have  been  suggested  [5, 6, 7, 8, 9].  The  general  approach  is  to  align  the 
acoustic  waveform  with  a  “reference”  waveform,  using  dynamic  programming  algo¬ 
rithms.  The  reference  is  either  a  waveform  generated  from  a  phonetic  transcription 
by  synthesis  techniques,  or  by  using  a  previously  labeled  utterance  having  the  same 
transcription.  However,  this  approach  requires  either  that  the  synthesis  technique 
be  of  high  quality  or  that  the  two  utterances  have  identical  phonetic  transcriptions 
(which  is  rare  across  speakers). 

Transcription  alignment  of  the  TIMIT  database  utilizes  CASPAR,  an  automatic 
alignment  system  developed  at  MIT.  Description  of  preliminary  implementations  of 
CASPAR  can  be  found  elsewhere  [10,11].  (One  of  these  descriptions  is  attached  with 
this  report  as  Appendix  B.)  Basically,  phonetic  alignment  is  accomplished  in  three 
steps.  First,  each  5  ms  frame  of  the  speech  data  is  assigned  to  one  of  five  broad- 
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class  labels:  sonorant ,  obstruent,  voiced-consonant,  nasal/voicebar,  and  silence,  using 
a  non- parametric  pattern  classifier.  The  assignment  process  uses  a  binary  decision 
tree,  based  on  a  set  of  acoustically  motivated  features.  Each  sequence  of  identically- 
labelled  frames  is  then  collapsed  into  a  segment  of  the  same  label,  thus  establishing 
a  broad-class  segmentation  of  the  speech.  Next,  the  output  of  the  initial  classifier  is 
aligned  with  the  phonetic  transcription  using  a  search  strategy  with  some  look-ahead 
capability,  guided  by  a  few  acoustic- phonetic  rules.  The  resulting  alignment  provides 
"islands  of  reliability”  for  more  detailed  segmentation  and  refinement  of  boundaries. 
Further  segmentation  of  acoustic  regions  that  correspond  to  two  or  more  phonetic 
events  after  preliminary  alignment  is  done  using  specific  algorithms  based  on  knowl¬ 
edge  of  the  phonetic  context.  In  some  instances,  heuristic  rules  are  invoked  to  assign 
consistent  but  somewhat  arbitrary  boundaries  (as  between  a  vowel  and  a  semivowel). 

It  was  discovered  in  a  formal  evaluation  that  CASPAR  can  correctly  perform  over 
95%  of  the  labeling  task  previously  done  by  human  transcribers.  The  boundary  loca¬ 
tions  produced  by  the  system  agree  well  with  those  produced  by  human  transcribers. 
For  example,  over  75%  of  the  automatically  generated  boundaries  were  within  10 
msec  of  a  boundary  entered  by  a  trained  phonetician. 

Whenever  the  automatic  alignment  system  makes  a  mistake  in  boundary  location, 
or  when  it  fails  to  find  a  single  possible  alignment,  human  intervention  is  necessary. 
Regardless,  the  output  of  the  alignment  system  must  be  certified  by  an  experienced 
acoustic-phonetician.  Therefore,  a  set  of  rules  was  specified  so  that  boundary  loca¬ 
tions  were  placed  as  consistently  as  possible  from  transcriber  to  transcriber.  These 
rules  are  described  in  Section  2.5.2.  For  a  more  in-depth  discussion  of  the  boundary 
criteria,  the  reader  is  referred  to  Appendix  C. 

Since  the  early  implementation  of  CASPAR,  as  described  in  the  literature,  two 
major  modifications  have  been  made.  First,  the  second  module  of  the  system  which 
aligns  the  acoustic  labels  with  the  phonetic  symbols  has  been  cast  into  a  probabilistic 
framework.  By  using  a  large  amount  of  speech  data  for  training,  a  set  of  context- 
dependent  and  durational  statistics  were  obtained.  As  a  result,  the  system  has  been 
found  to  be  more  robust.  Second,  a  new  fourth  module  has  been  added  to  the  sys¬ 
tem  to  improve  the  resolution  of  the  boundaries.  This  module  computes  appropriate 
acoustic  attributes  at  a  high  analysis  rate  using  different  window  shapes  that  depend 
on  the  specific  context.  The  boundaries  are  then  adjusted  based  on  these  new  at¬ 
tributes. 
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2.5  Transcription  and  Alignment 

A  detailed  description  of  the  procedure  used  for  entering  the  aligned  transcription 
is  provided  in  Section  2.5.3.  For  further  analysis  on  this  process,  the  reader  is  referred 
to  Appendix  C. 

2.5.1  The  Acoustic-Phonetic  Label  Set 

The  label  set  used  to  provide  the  time-aligned  acoustic-phonetic  sequence  is  in¬ 
tended  to  represent  a  level  somewhat  intermediate  between  phonemic  and  acoustic. 
Our  b'jlief  was  that  clear  acoustic  boundaries  in  the  waveform  should  all  be  marked, 
and  that  the  criteria  for  positioning  the  boundaries  between  units  should  in  part  be 
baued  on  our  ability  to  mark  them  consistently. 

In  addition  to  the  phonemes,  the  set  of  recognized  acoustic-phonetic  labels  in¬ 
cludes: 

•  Stop  closures,  as  stops  axe  characterized  by  a  sequence  of  a  closure  and  a  release, 

•  Stop  allophones,  the  glottal  stop  [?}  and  the  flap  [r],  with  a  separate  flapping 
decision  for  /t  /  and  /d/, 

•  Two  allophones  of  /h/:  voiced  [h]  and  unvoiced  [A],  based  on  an  analysis  of  the 
waveform  for  clear  low  frequency  periodicity, 

•  Four  diphthongs,  /ay/,  /o^/,  /e*/,  and  /aw/,  each  represented  as  a  single  label 
with  no  separate  region  defined  for  the  offglide  portion, 

•  Two  vowel  phonemic  forms  represented  by  more  than  one  allophone:  schwa  and 
/u/.  For  /u/  a  back  [u]  and  front  [u]  allophone  Me  recognized.  Schwa  has  four 
separate  allophones:  back  [a],  front  [i],  retroflex  [9-]  and  devoiced  [3]. 

•  Four  syllabic  consonants:  [m,n,ij,l]. 

Our  label  set  also  includes  a  category  “epenthetic  silence,”  which  we  use  to  mark 
acoustically  distinct  regions  of  weak  energy  separating  sounds  that  involve  a  change 
in  voicing.  These  short  gaps  are  typically  due  to  articulatory  timing  errors.  The  most 
common  occurrences  of  such  gaps  Me  between  an  /s/  and  a  following  semivowel  or 
nasal,  as  in  “small”  or  “swift.” 

In  general,  we  tried  to  label  what  we  heMd/saw  rather  than  what  we  expected. 
Thus,  if  a  person  said  “imput”  for  “input,”  the  nasal  would  be  mMked  as  an  /m/. 
However,  in  conditions  of  ambiguity,  the  underlying  phonemic  form  was  preferentially 
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selected.  For  further  listings  and  definitions  of  the  label  set,  the  reader  is  referred  to 
Appendix  C. 

2.5.2  Criteria  for  Boundary  Assignments 

Often  the  boundaries  between  two  acoustic-phonetic  units  are  clear  and  well- 
defined.  However,  there  are  a  number  of  cases  where  the  exact  placement  of  a  bound¬ 
ary  is  problematic,  or  where  it  is  not  clear  whether  a  region  should  be  represented  as 
one  or  two  acoustic-phonetic  units.  We  tried  to  define  a  set  of  criteria  that  would  be 
systematic  and  least  prone  to  human  error,  in  order  to  produce  boundary  positionings 
that  were  as  consistent  as  possible. 

As  mentioned  previously,  we  decided  that  the  boundary  between  the  closure  in¬ 
terval  and  the  release  of  a  stop  is  an  important  one  that  should  be  marked.  It  is 
certainly  a  very  distinct  landmark  in  the  waveform.  Anyone  interested  in  studying 
the  burst  characteristics  of  a  stop  would  then  be  able  to  focus  on  just  the  region  that 
includes  the  released  portion.  In  a  strictly  phonemic  representation,  the  closure  and 
release  would  be  represented  as  a  single  unit,  leaving  that  critical  boundary  unmarked. 

A  type  of  problematic  boundary  is  the  one  separating  a  prevocalic  stop  from  a 
following  semivowel,  as  in  “truck.”  Typically,  part  of  the  /r/  is  devoiced,  and  there¬ 
fore  is  absorbed  into  the  aspiration  portion  of  the  stop.  If  listening  were  the  only 
criterion,  then  the  left  boundary  of  the  /r/  would  occur  somewhere  in  the  aspiration, 
and  the  right  boundary  would  occur  somewhere  after  voicing  onset.  A  clear  acoustic 
boundary  at  the  point  of  voice  onset  would  remain  unmarked.  It  would  also  be  diffi¬ 
cult  to  decide  where  to  mark  the  boundary  between  the  stop  burst  and  the  aspirated 
/r/  portion.  Since  voice-onset  time  (VOT)  is  a  parameter  that  has  been  a  focus  of 
many  research  projects,  it  seems  unsatisfactory  not  to  include  a  reliable  mechanism 
for  measuring  VOT  based  on  the  labelled  boundaries.  Therefore,  we  adopted  the  pol¬ 
icy  of  always  absorbing  into  the  stop  release  all  of  the  unvoiced  portion  of  a  following 
semivowel. 

The  right-hand  boundary  of  many  prevocalic  semivowels,  and  the  left-hand  bound¬ 
ary  of  post- vocalic  semivowels  are  both  rather  ill-defined  in  the  spectrogram,  because 
the  transitions  are  slow  and  continuous.  It  is  not  possible  to  define  a  single  point  in 
time  that  separates  the  vowel  from  the  semivowel.  In  such  cases,  we  decided  to  adopt 
a  1/3  -  2/3  formula,  giving  the  vowel  twice  as  much  duration  as  the  semivowel. 

Whenever  the  same  phoneme  follows  itself  (gemination),  we  did  not  attempt  to 
mark  a  boundary  between  the  two  units.  This  situation  occurs  exclusively  at  word 


10 


boundaries,  as  in  “some  money.”  Furthermore,  in  the  case  of  a  stop-stop  sequence 
where  the  first  stop  is  unreleased,  the  closure  interval  was  identified  with  the  first 
stop  and  the  release  with  the  second  one. 

2.5.3  Procedure  for  Entering  the  Aligned  Transcription 

The  labelling  process  involved  three  steps: 

1.  An  acoustic-phonetic  sequence  was  entered  by  hand  as  a  string. 

2.  The  speech  waveform  was  aligned  automatically  with  the  acoustic-phonetic  se¬ 
quence,  using  the  system  described  in  Section  2.4. 

3.  The  automatically  generated  boundaries  were  hand  corrected. 

In  steps  1  and  3,  the  labeler  made  use  of  the  displayed  spectrogram,  spectral  cross 
sections,  the  original  waveform,  and  auditory  output.  This  process  took  place  within 
the  SPIRE  software  facility  for  analyzing  speech,  a  powerful  interactive  tool  that  is 
well- matched  to  this  task  [3]. 

The  first  step  required  less  intensive  use  of  the  SPIRE  tool  than  the  third  step, 
because  it  was  only  necessary  to  record  what  was  heard,  without  identifying  the  time 
locations  of  the  events.  The  labels  were  entered  either  by  typing  or  by  mousing  from 
a  displayed  set.  Judgments  were  made  using  the  spectrogram  and  waveform  displays, 
as  well  as  careful  listening. 

Once  a  phonetic  sequence  has  been  provided,  a  preliminary  alignment  of  the  pho¬ 
netic  symbols  with  the  waveform  is  performed  using  CASPAR,  as  described  in  Section 
2.4.  Any  errors  in  the  automatically  aligned  acoustic-phonetic  sequence  is  then  cor¬ 
rected  by  hand.  Hand-correction  was  based  on  critical  listening  of  portions  of  the 
utterance  as  well  as  visual  examination  of  the  spectrogram  and  the  waveform.  The 
SPIRE  layout  for  this  stage  is  shown  in  Figure  1.  As  illustrated  in  the  figure,  the  tran¬ 
scription  boundaries  are  overlaid  on  the  spectrogram  for  ease  of  decision-making.  The 
spectrogram  covers  close  to  3  seconds  of  speech  at  one  time,  whereas  the  waveform 
is  displayed  on  a  much  more  expanded  time  scale.  Any  subportions  of  the  waveform 
or  of  the  spectrogram  could  be  used  to  define  regions  to  be  played  out  to  earphones. 
In  addition,  a  phonetically-labelled  region  in  the  spectrogram  could  be  moused  such 
that  only  the  portion  between  the  two  boundaries  was  played. 

The  mouse  was  used  to  move  existing  boundaries  to  new  points  in  time,  to  erase 
boundaries,  or  to  insert  new  boundaries.  In  addition,  specified  mouse  clicks  on  any 


segment  allowed  the  labeler  to  change  the  aeons  tic-phonetic  label  associated  with 
that  segment.  This  step  was  occasionally  necessary  to  correct  an  error  of  judgment 
made  in  step  1. 

Once  the  phonetic  transcription  is  aligned,  it  is  rather  straightforward  to  propa¬ 
gate  the  alignment  up  to  the  orthographic  transcription  as  well  as  the  intermediate 
phonemic  transcription.  A  time-aligned  orthographic  transcription  is  useful  when 
searching. for  a  specific  word,  while  time-aligned  phonemic  transcription  can  be  used 
to  relate  the  lexical  representation  of  words  to  their  acoustic  realizations. 

In  addition  to  the  acoustic-phonetic  alignment  system  described  above,  we  have 
also  developed  a  system  that  maps  a  time-aligned  acoustic-phonetic  transcription  to 
the  phonemic  and  orthographic  transcriptions  [12].  However,  the  alignment  effort  for 
these  transcriptions  lags  somewhat  behind  the  phonetic  alignment.  We  will  provide 
these  transcriptions  in  a  future  release. 


2.6  Database  Management  System  Development 

We  have  implemented  an  utterance  database  management /access  system  (DBMS) 
based  on  the  SEARCH  [3]  statistical  analysis  tool.  The  SPIRE  Utterance  Database 
System  (or  SUDS)  works  by  "scanning”  a  user-defined  database  of  utterances.  The 
scanning  procedure  is  much  less  time-  and  memory-intensive  than  actually  loading 
the  utterances.  A  SEARCH  sample  is  then  built  from  the  database,  and  all  the  power 
of  SEARCH  may  be  utilized  to  deal  with  the  problem  being  examined.  Utterance 
databases  (and  their  associated  samples)  may  be  saved,  loaded,  and  "rescanned”  or 
revised.  The  rescanning  procedure  is  faster  than  scanning,  and  consists  of  updating 
information  about  those  files  that  have  changed,  and  then  rebuilding  the  sample. 
Rescanning  is  useful  when,  for  example,  a  few  transcriptions  have  been  changed  in  a 
database  of  hundreds  of  utterances. 

Commands  already  implemented  in  SUDS  include  Load  Utterances,  Delete  Ut¬ 
terances,  and  Dump  Utterances,  which  can  be  used  for  making  tapes  of  the  selected 
utterances.  There  is  also  a  Save  Utterance  List  command,  which  can  be  used  to  write 
a  file  with  the  filenames  of  the  selected  utterances.  The  file  can  then  be  processed  by 
arbitrary  user  code.  The  entire  system  is  designed  for  easy  extensibility,  and  other 
commands  can  be  added  with  very  little  effort. 


2.7  Database  Distribution 

The  acoustic-phonetic  database  was  completely  phonetically  transcribed,  aligned, 
and  checked  as  of  June  1988.  As  the  sentences  were  completed,  they  were  sent  to 
NBS,  where  they  were  examined  and  prepared  for  distribution.  The  distribution  is 
bang  accomplished  in  three  balanced  stages,  of  which  two- third  has  already  been 
released.  Currently,  the  database  is  available  to  general  public  via  magnetic  tapes, 
although  plans  for  compact  disc  releases  are  well  under  way. 

Many  minor  errors  in  the  database  have  been  found  and  corrected,  both  at  MIT 
and  at  NBS,  but  despite  our  best  intentions,  more  errors  undoubtedly  exist.  It  is  our 
intention  to  continually  provide  corrections  and  updates  for  the  foreseeable  future. 
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ABSTRACT 

Th«  need  for  a  comprehensive,  standardised  speech 
database  is  threefold:  first,  to  acquire  acoustic-phonetic 
knowledge  for  phonetic  recognition;  second,  to  provide 
speech  for  training  recognisers;  and  third,  to  provide  a 
common  test  base  for  the  evaluation  of  recognisers.  There 
are  many  factors  to  consider  in  corpus  design,  making  it 
impossible  to  provide  a  complete  database  for  all  poten¬ 
tial  users.  It  is  possible,  however,  to  provide  an  acceptable 
database  that  can  be  extended  to  meet  future  needs.  Af¬ 
ter  much  discussion  among  several  sites,  a  consensus  was 
reached  that  the  initial  acoustic-phonetic  corpus  should 
consist  of  calibration  sentences,  a  set  of  phonetically  com¬ 
pact  sentences,  and  a  large  number  of  randomly  selected 
sentences  to  provide  contextual  variation.  The  database 
design  has  been  a  joint  effort  including  MIT,  SRI,  aad  Tl. 
This  paper  describes  MIT’s  role  in  corpus  development 
and  analyses  of  the  phonetic  coverage  of  the  complete 
database.  We  also  include  a  description  of  the  phonetic 
transcription  and  alignment  procedure. 

INTRODUCTION 

The  development  of  a  common  speech  database  is  of 
primary  importance  for  continuous  speech  recognition  ef¬ 
forts.  Such  a  database  is  needed  in  order  to  acquire  acoustic- 
phonetic  knowledge,  develop  acoustic-phonetic  classifica¬ 
tion  algorithms,  and  train  and  evaluate  speech  recognis¬ 
ers.  The  acoustic  realisation  of  phonetic  segments  results 
from  a  multitude  of  factors,  including  the  canonical  char¬ 
acteristics  of  the  phoneme,  contextual  dependencies,  and 
syntactic  and  extralinguistk  factors.  A  large  database  will 
make  it  possible  to  examine  in  detail  many  of  these  fac¬ 
tors,  with  the  hope  of  eventually  understanding  acoustic 
variability  well  enough-  to  design  robust  speech  recognis¬ 
ers.  A  complete  database  should  include  different  styles 
of  speech,  such  as  isolated  words,  sentences  and  para¬ 
graphs  read  aloud,  and  conversational  speech.  The  speech 
samples  should  be  gathered  from  many  speakers  (at  least 
several  hundred)  of  varying  ages,  both  male  and  female, 
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with  a  good  representation  of  the  major  regional  dialects 
of  American  English. 

DESIGN  CONSIDERATIONS 

There  are  many  factors  to  consider  in  designing  a  large 
corpus  for  speech  analysis.  Unfortunately,  some  of  the 
goals  are  limited  by  practical  considerations.  Ideally  we 
would  like  to  include  multiple  samples  of  all  phonemes  in 
all  contexts,  a  goal  that  is  clearly  impractical  for  a  man¬ 
ageable  database. 

At  the  last  DARPA  review  meeting  it  was  decided  that 
an  initial  acoustic-phonetic  database  would  be  designed 
to  have  good  phonetic  coverage  of  American  English.  It 
was  agreed  that  the  initial  acoustic-phonetic  corpus  would 
include  calibration  sentences  (spoken  by  every  talker),  a 
small  set  of  phonetically  compact  sentences  (each  spoken 
by  several  talkers)  and  a  large  number  of  sentences  (each 
to  be  spoken  by  a  single  talker).  This  combination  was 
chosen  to  balance  the  conflicting  desires  for  compact  pho¬ 
netic  coverage,  contextual  diversity,  and  speaker  variabil¬ 
ity.  Another  requirement  of  the  corpus  was  that  the  sen¬ 
tences  should  be  reasonably  short  and  easy  to  say. 

The  database  design  is  a  joint  effort  between  MIT, 
SRI,  and  TI.  The  speaker  calibration  sentences,  provided 
by  SRI,  were  designed  to  incorporate  phonemes  in  con¬ 
texts  where  significant  dialectical  differences  are  antici¬ 
pated.  They  will  be  spoken  by  all  talkers.  The  second 
set  of  sentences,  the  phonetically  compact  sentences,  was 
hand-designed  by  MIT  with  emphasis  on  as  complete  a 
coverage  of  phonetic  pairs  as  is  practical.  Each  of  these 
sentences  will  be  spoken  by  several  talkers,  in  order  to  pro¬ 
vide  a  feeling  for  speaker  variation.  Since  it  is  extremely 
time-consuming  and  difficult  to  create  sentences  that  are 
both  phonetically  compact  and  complete,  a  third  set  of 
randomly  nlectei  sentences,  chosen  by  TI,  provides  alter¬ 
nate  contexts  and  multiple  occurrences  of  the  same  pho¬ 
netic  sequence  in  different  word  sequences. 

A  breakdown  of  the  actual  sentence  corpus  is  shown 
in  Table  I.  This  arrangement  was  chosen  to  balance  the 
conflicting  desires  for  capturing  inter-speaker  variability 
and  providing  contextual  diversity.  Since  the  calibration 


No.  Talkers 

No.  Sentences 

Calibration  (SRI) 

640 

■ 

1280 

Compact  (MIT) 

7 

3150 

Random  (TI) 

1 

1800 

Total 

— 

- 

E 22 

Tab!*  1:  Breakdown  of  Frequencies  of  Occurrence  of  Sentence* 
in  Corpus 


sentence*  are  spoken  by  nil  of  the  speakers,  they  should 
be  nsefnl  for  defining  dialectical  differences.  For  multiple 
instances  of  the  exact  same  phonetic  environments,  but 
with  a  much  richer  acoustic-phonetic  content  than  in  the 
calibration  sentences,  the  MIT  set  would  be  appropriate. 
The  TI  sentences,  to  be  spoken  by  one  talker  per  sentence, 
should  provide  data  for  phoneme  sequences  not  covered  by 
the  MIT  database. 

DESIGN  OF  THE  COMPACT 
ACOUSTIC-PHONETIC  SENTENCES 

A  set  of  450  sentences  was  hand-designed  at  MIT,  us¬ 
ing  an  iterative  procedure,  to  be  both  compact  and  com¬ 
prehensive.  We  made  no  attempt  to  phonetically  bal¬ 
ance  the  sentences.  We  used  ALexit  and  the  Merriam- 
Webster  Pocket  Dictionary  (Pocket)  to  interactively  create 
sentences  and  analyse  the  resulting  corpus.  We  began  with 
the  ‘summer*  corpus  <.  rented  for  the  MIT  speech  spectro¬ 
gram  reading  course  to  include  basic  phonetic  coverage 
and  interesting  phonetic  environments.  We  initially  aug¬ 
mented  these  sentences  by  looking  at  pairs  of  phonemes, 
trying  to  have  at  least  one  occurrence  of  each  phoneme 
pair  sequence.  ALtxu  was  used  to  search  the  Pocket  dic¬ 
tionary  for  words  having  sequences  that  were  not  repre¬ 
sented  and  for  words  beginning  or  ending  with  a  specific 
phoneme.  We  then  created  sentences  using  the  new  words 
and  added  them  to  the  corpus.  Certain  difficult  sequences 
were  emphasised,  such  as  vowel-vowel  and  stop-stop  se¬ 
quences.  Some  phoneme  pairs  are  impossible;  others  are 
extremely  rare  and  occur  only  across  word  boundaries. 
For  example,  /w/  and  /y/  never  close  a  syllable,  except 
as  an  off-glide  to  a  vowel,  so  many  /w/-phoneme  pairs  are 
impossible.  After  filling  some  of  the  gaps  in  coverage,  we 
reanalysed  the  sentences  with  regard  to  phoneme  pair  cov¬ 
erage,  consonant  sequence  coverage,  and  the  potential  for 
applying  phonological  rules  both  within  words  and  across 
word  boundaries.  In  a  final  pass  through  the  sentence  set, 
we  modified  and  enriched  sentences  where  simple  substi¬ 
tutions  could  introduce  variety  or  generate  an  instance  of 
a  rare  phoneme  pair. 

ANALYSIS  OF  PHONETIC 
COVERAGE 

This  section  discusses  the  phonetic  coverage  of  the  com¬ 
pact  sentence  set  developed  at  MIT  and  the  resulting  cor- 

1 

i 


pus  consisting  of  the  combined  MIT  and  TI  sentences. 
This  analysis  does  not  include  the  calibration  sentences 
as  we  consider  their  use  to  be  of  a  different  nature. 


HL 

Mrr-450 

APDB 

#  sentences 

720 

450 

5040 

#  unique  words 

10,837 

1804 

1702 

5107 

#  words 

10,837 

5745 

3403 

41,161 

ave  #  words/sent 

7.0 

7.6 

8.2 

min  #  words/sent 

5 

4 

2 

max  #  words/sent 

12 

13 

10 

ave  #  syls/word 

1.38* 

1.1 

I.S8 

1.54 

ave  #  phones/word 

3.34* 

2.07 

4.0 

3.80 

*  The  ave  #  syls/word  and  ave  #  phonee/word  have  been 
weighted  by  Brown  Corpus(l|  word  frequencies. 


Table  2:  Description  of  Databases 

Table  2  compares  some  of  the  distributional  properties 
of  the  Pocket  Lexicon  (Pocket),  the  Harvard  List  (HL)|2|, 
ths  MIT-selected  sentences  (MIT-450),  and  the  Acoustic- 
Phonetic  Database  selected  sentences  (APDB).  The  APDB 
include*  seven  copies  of  each  MIT-450  sentence,  to  account 
for  the  number  of  talkers  per  sentence,  and  a  single  copy  of 
each  randomly  selected  sentence  (D-1800).  Since  we  wen 
given  only  the  orthographies  for  the  D-1800  sentences,  we 
generated  phonemic  transcriptions  by  dictionary  lookup, 
by  ruie-baaed  expansion  of  the  dictionary  entries,  and,  as 
a  last  resort,  by  a  text-to-speech  synthesiser.  We  expect 
that  there  are  pronunciation  variations  between  the  dictio¬ 
nary  and  the  text-to-speech  synthesiser,  particularly  with 
respect  to  vowel  color.  There  may  also  be  some  pronun¬ 
ciation  errors,  but  we  think  these  will  be  statistically  in¬ 
significant. 

Die  proportion  of  unique  words  relative  to  the  total 
number  of  words  is  substantially  larger  in  the  MIT-450 
than  the  APDB,  probably  due  to  the  selection  procedure. 
We  tried  to  use  new  words  in  sentences  and  to  avoid  dupli¬ 
cation  when  at  aO  possible.  Roughly  SOX  of  the  MIT-450 
words  are  unique,  as  compared  to  only  25X  of  the  APDB 
words.  The  D-lfiOO  sentences  are,  on  the  average,  slightly 
longer  than  those  in  the  MIT-450.  The  10  most  frequently 
occurring  words  for  all  of  the  corpora  are  function  words  or 
pronouns.  In  both  the  MIT-450  and  the  APDB  corpora, 
the  most  common  word  is  *the,”  accounting  for  roughly 
7%of  all  words. 

The  average  numbers  of  syllables  and  phones  per  word 
are  longer  for  the  MIT-450  and  the  APDB  than  for  the 
HL.  This  is  presumably  due  to  the  higher  percentage  of 
polysyllabic  words. 

Figure  1  shows  the  distribution  of  the  number  of  sylla¬ 
bles  per  word  for  the  two  corpora.  The  distributions  are 
quite  similar,  with  the  majority  of  the  words  being  mono- 
or  bi-syllabic.  The  MIT-450  corpus  has  a  slightly  higher 
percentage  of  polysyllabic  words  than  does  the  combined 
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Figure  Is  Histogram*  of  the  number  of  syllable*  per  word. 
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Figure  Ss  Histograms  of  the  10  most  common  phoneme*. 


corpus.  Wo  specifically  tried  to  include  polysyllabic  words 
in  the  sentences,  since  these  are  likely  to  be  spoken  with 
greater  variability. 

Distributions  of  the  number  of  phonemes  per  word  are 
shown  in  Figure  2.  The  10  most  common  phonemes  and 
their  frequency  of  occurrence  are  given  in  Figure  3. 

Table  3  shows  the  distribution  of  within-word  conso¬ 
nant  ssquences  for  the  four  databases.  The  MIT-450  sen¬ 
tence  set  coven  most  of  the  consonant  sequences  occurring 
within  words.  The  APDB  has  more  complete  coverage, 
particularly  for  the  word-final  and  word-medial  sequences. 
We  examined  a  list  of  all  of  the  word-initial  and  word-final 
dusters  in  the  sentence  list,  and  compared  these  with  the 
occurrences  in  Pocket.  We  verified  that  essentially  every 
initial  cluster  that  occurred  more  than  once  in  the  Pocket 
lexicon  was  included  at  least  once  in  the  APDB,  and  that 
most  of  the  final  clusters  were  covered.  Often,  if  a  word- 
final  cluster  did  not  occur  in  word-final  position  in  the 
APDB,  the  sequence  did  occur  within  a  word  or  across 
a  word  boundary.  Generally,  the  sequences  occurring  in 
Pocket  that  are  not  covered  by  APDB  are  from  borrowed 
words  such  ‘moire*  and  ‘svelte.* 

The  APDB  includes  many  word-final  consonant  sequen- 
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Figure  3:  Histogram*  of  the  number  of  phoneme*  per  word. 


#  uniqus  words 

10,837 

nrn 

1702 

#  WI 

75 

WEI 

84 

88 

#  WF 

120 

na 

148 

#  WM 

228 

388 

#  boundaries 

2053 

E2ED 

#  WB 

LaJ 

805 

1830 

Table  Si  Dtstrtbutiow  o t  Consonant  Sequences 


cos  that  were  not  present  in  MIT-410.  In  bet,  there  are 
more  word-final  consonant  sequences  in  the  APDB  than 
actually  occur  in  Pocket.  The  reason  is  that  the  Pocket 
lexicon  does  not  include  suffixes. 

A  more  detailed  phonetic  analysis  of  all  pho asms  pairs 
is  included  in  Appendix  1  in  tabular  form.  The  tables  are 
broken  down  into  phoneme  subsets,  and  data  are  included 
for  both  the  MIT-450  and  the  APDB.  Some  of  the  gape  ia 
the  MIT-450  table  have  been  filled  in  by  sentences  in  the 
TI-1800  corpus  (e.g.,  th~  syllabic  /l/  column  of  the  vowel- 
sonorant  pairs  table  and  the  /y/  column  of  the  vowei- 
sonorant  pairs  table).  Note  also  that  some  gaps  occur  ia 
both  tables.  Such  gaps  are  expected,  since  some  phoneme 
sequences  are  impossible  or  quite  rare.  For  example,  the 
lax  vowels  (excluding  schwa)  are  never  found  in  syllable- 
final  position  ia  English.  As  a  consequence,  table  entries 
requiring  lax  vowels  as  the  first  member  of  a  pair  have 
many  gape  (see  for  example,  the  vowel-vowel  entries  ia 
the  pair  tables.) 

Figure  4  compares  histograms  of  the  sentence  types  for 
the  MIT-450  and  the  APDB.  Simple  sentences  (Simple  S.) 
and  questions  (Simple  Q.)  have  no  major  syntactic  mark¬ 
ers.  Complex  sentences  (Complex  S.)  and  questions  (Com¬ 
plex  Q.)  are  expected  to  have  a  major  syntactic  boundary 
when  read.  As  can  be  seen,  the  APDB  has  a  wider  vari¬ 
ety  of  sentence  types,  with  75%  being  simple  declarative 
sentences.  In  the  MIT-450,  almost  85%  of  the  sentences 
are  of  the  simple  declarative  form.  Questions  form  about 
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Figure  4:  Histogram  of  sentence  types. 

10%  of  both  corpora. 

Figure  5  shows  counts  of  environments  where  major 
phonological  rules  may  apply.  We  choee  to  gather  infor¬ 
mation  on  the  following  possibilities: 

-  gemination  (GEM) 

•  vowel- vowel  sequences  (WS) 


-  vowel-schwa  sequences  (VSS) 

-  schwa-vowel  sequences  (SVS) 

•  Sapping  of  /4/,/d/,  and  /n /  (FLAP) 

-  homorgank  stop  insertion  (HSI) 

-  schwa  d evoking  (S-DVC) 

-  fricative  devoking  (F-DVC) 

•  /s/-/I/  and  /»/-/!/  palatalisation  (PAL) 

•  y-palatalisation:  /dy/ — »/J/  (DY-Jh) 

•  y- palatalisation:  /ty/ — */C /  (TY-Ch) 

•  y-palatalisation:  /ey /—»/!/  (SY-Sh) 

The  histograms  show  that  both  corpora  have  many  po¬ 
tential  environments  for  flapping  and  homorgank  stop  in¬ 
sertion.  The  vowel-vowel  environments  are  also  well  cov¬ 
ered.  The  analysis  for  phonologkal  rule  application  is  diffi¬ 
cult,  because  of  the  difficulties  in  predkting  what  different 
speakers  will  say. 

RECORDING,  LABELING,  AND 
ALIGNMENT 

The  recording  of  the  sentences  is  currently  under  way  at 
TL  Speech  is  recorded  digitally  at  SO  kHs,  simultaneously 
on  a  pressure-sensitive  microphone  and  on  a  Senaheiser 
dose-talking  microphone.  Digital  tapes  are  shipped  to 
NBS,  where  they  are  filtered  and  downsampled  to  10  Mb. 
The  resampled  tapes  are  then  shipped  to  hTT  where  the 
orthographk  and  phonetk  transcriptions  an  generated. 

Transcriptions  are  generated  using  the  Spin  facility, 
in  conjunction  with  the  antomatk  alignment  system  pro- 
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Figure  S:  Histogram  for  potential  application  of  phonological  rules. 
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Unvoiced  Stooe: 
Voiced  Stops: 

Stop  Ceps: 

Nasals: 

Syllabic  Nesels: 
Unvoiced  Fricatives: 
Voiced  Fricstives: 
Glides: 

Vowels: 

Schwa: 

H,  Silences: 


p  t  k  S 
b  d  g  I 

tf  t*  k*  »  tf  «r  g*  c 

n  m  g  t 
0  V  0  I 
a  I  f  • 
t  I  v  • 

I  r  w  y 

f*  :  r  o’  ■  a  a"  ■’ 


Aoo’rf’ouu* 

•I'd 

h  fi  0  a 
Figure  4:  Phones  nstd  for  labeling. 


sided  by  Leung  [3].  The  transcriptkn  process  involves 
three  steps: 

I.  A  'Phonetic  Sequence,*  which  consists  of  s  list  of 
the  phones  of  the  utterance  in  correct  temporal  order 
but  with  no  boundaries  marked  in  time,  is  entered. 

3.  The  utterance  is  run  through  an  automatic  system 
to  generate  an  alignment  for  the  sequence. 

3.  The  automatically  generated  alignment  is  hand- 
corrected. 

Only  the  data  recorded  through  the  pressure  micro¬ 
phone  are  transcribed.  Transcriptions  for  the  dose-talking 
version  are  generated  by  duplicating  the  results  for  the 
pressure  microphone. 

The  phones  used  in  the  labeling  are  shown  in  Figure 
6.  In  many  cases,  it  is  not  possible  to  define  a  boundary 
between  two  phones,  such  as  /or/,  because  features  appro¬ 
priate  for  both  phones  often  occur  simultaneously  in  time. 
When  no  obvious  positioning  of  the  boundary  is  apparent, 
arbitrary  rules,  such  as  an  automatic  3/3: 1/3  split,  are  in¬ 
voked.  There  are  also  some  cases  in  which  none  of  our 
standard  phones  are  appropriate  for  a  given  portion  of  the 
speech,  primarily  because  of  severe  coarticulation  effects. 
In  such  cases,  the  segment  is  labeled  as  the  nearest  phone 
eqnivalent,  according  to  the  transcriber’s  judgment.  There 
are  other  difficult  cases,  such  as  syllable-initial  /pi/,  where 
the  /!/  is  devoked  at  onset.  Should  the  portion  before 
voking  begins  be  thought  of  as  part  of  the  aspiration  of 
the  /p/s  or  as  part  of  the  /!/?  We  have  decided,  somewhat 
arbitrarily,  to  define  the  onset  time  of  the  phone  following 
an  unvoked  stop  as  coincident  with  the  onset  of  Yoking. 
These  remarks  serve  simply  as  examples  of  some  of  the  dif¬ 
ficulties  that  arise  in  transcribing  continuous  speech.  We 
are  mainly  interested  la  using  consistent  methods  of  tran¬ 
scribing  in  situations  where  ambiguity  exists.  Currently 
the  transcription  rate  is  100  sentences  per  week. 

SUMMARY 

We  have  described  various  components  of  the  prelimi¬ 
nary  acoustk-phonetic  database  and  discussed  some  of  the 
issues  in  its  design.  Evaluating  the  phonetic  coverage  of 
the  database  is  difficult  primarily  because  no 


dard  for  comparison  exists.  We  have  chosen  to  compare 
the  phonetk  coverage  of  the  database  to  two  well-known 
sources,  the  Merriam- Webster  Pocket  Dktionary  of  1904 
and  the  Harvard  List  sentences.  The  dktionary  does  not 
reflect  spoken  English  very  well,  and  can  only  guide  us 
in  judging  the  possible  phonemk  sequences  within  words. 
The  Harvard  List  sentences,  while  phonemkally  balanced, 
consist  primarily  of  very  simplistk  sentences  and  monosyl- 
labk  words.  In  addition,  they  are  balanced  for  phoneme 
occurrences,  whereas  we  tried  to  account  for  occurrences 
of  phoneme  pairs. 

We  believe  that  we  have  adequate  coverage  of  most 
phonemes  and  phoneme  pairs.  In  cases  where  the  phoneme 
pairs  are  scarce,  there  are  often  other  phoneme  pairs  that 
will  provide  similar  information.  For  example,  the  class 
sequence  [alveolar  consonant]  [back  vowel]  is  more 
general  than  /t/  /»/,  and  has  a  higher  frequency  of  occur¬ 
rence. 

We  hope  that  the  APDB  database  will  provide  guide¬ 
lines  for  the  development  of  future  databases.  An  analy¬ 
sis  of  the  spoken  corpus  will  enable  us  to  judge  our  pho¬ 
netic  analysis  procedure.  In  particular,  we  will  be  able 
to  evaluate  the  relationship  between  our  phonological  rule 
predictions  and  the  frequency  with  whkh  a  phonological 
modification  actually  occurred. 
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.  ABSTRACT 

A  system  for  automatic  alignment  of  phonetic  transcriptions  with  continuous 
speech  has  been  developed.  The  speech  signal  is  first  segmented  into  broad 
classes  using  a  non-paramctric  pattern  daw  Her.  A  knowledge-baaed  dynamic 
programming  algorithm  then  aligns  the  broad  daaes  with  the  phonetic 
transcription.  These  broad  classes  provide  "Wands  of  retisbdity'  for  mors 
detailed  segmentation  and  refinement  of  boundaries.  By  doing  alignment  at 
the  phonetic  level,  the  system  can  often  tolerate  inter  and  intra-speaker 
variability.  The  system  was  evaluated  on  sixty  sentences  spoken  by  three 
speakers,  two  male  end  one  female.  93%  of  the  segments  are  mapped  into  only 
one  phoneme.  70%  of  the  time  the  offset  between  the  boundary  found  by  the 
automatic  alignment  system  and  a  hand  transcriber  a  less  than  10  na.  The 
performance  can  be  unproved  by  applying  more  heuristic  rules. 


INTRODUCTION 

The  alignment  of  a  speech  signal  with  its  corresponding  phonetic 
transcription  is  an  essential  process  in  speech  research,  since  the  tune- 
aligned  transcription  can  serve  as  pointers  to  specific  phonetic  events  in 
the  waveform.  If  a  sufficient  amount  of  time-aligned  acoustic  data  li 
available,  speech  researchers  will  then  be  able  to  quantify  the  properties 
of  phonetic  segments  and  describe  how  their  characteristics  ate 
modified  by  contexts.  These  results  m  turn  wilt  lead  to  a  better  model 
for  speech  production,  as  well  as  better  rules  for  speech  synthesis  and 
recognition. 

Traditionally,  the  alignment  is  done  manually  by  a  trained  acoustic 
phonetician,  who  listens  to  the  speech  signal  and  visually  examines 
various  displays  of  the  signal.  There  are  several  disadvantages  to  this 
approach.  First  the  task  is  extremely  time  consuming;  even  under  the 
best  of  circumstances,  the  process  of  time  alignment  can  take  several 
minutes  for  one  second  of  speech  material.  Second,  the  task  requites 
the  skill  and  knowledge  possessed  by  a  small  number  of  experts.  These 
two  reasons  combine  to  severely  limit  the  amount  of  data  that  can  be 
collected  in  this  manner.  Third,  there  is  die  lack  of  consistency  and 
reproducibility  of  the  results.  Manual  labeling  often  involves  decisions 
that  are  highly  subjective.  Even  if  the  sentence  and  the  transcription 
were  the  same,  the  inter-  and  imra-transcribcr  variability  can  still  be 
quite  high.  Finally,  there  is  the  problem  of  human  error  associated 
with  tedious  tasks. 

The  problems  associated  with  manual  labeling,  together  with  the 
need  for  a  Urge  corpus  of  time-aligned  data,  clearly  call  for  the 
development  of  an  automatic  time-alignment  system.  From  the 
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practical  standpoint  of  developing  a  phonetically-based  speech 
recognition  system,  automatic  time  alignment  will  not  only  help 
enhance  our  basic  acoustic-phonetic  knowledge,  but  also  provide  a 
testbed  for  specific  recognition  algorithms.  In  other  words,  knowing 
what  the  phonetic  strings  are  should  make  it  easier  for  us  to  find  the 
phonetic  segments. 

Over  (he  past  few  years,  several  automatic  time  alignment  procedure* 
have  been  suggested  in  the  thermite.  Most  of  these  approaches 
attempt  to  align  the  speech  waveform  with  a  reference  waveform,  using 
dynamic  programming  algorithms.  The  reference  waveform  may  be  a 
known  and  previously  labeled  utterance  (3.4).  a  concatenation  of  stored 
templates  (5).  or  a  synthetically  generated  utterance  (6J.  In  order  hr 
these  methods  to  be  effective,  the  two  waveforms  must  not  differ 
significantly  in  detailed  phonetic  structures,  or  the  synthesis  rules  must 
be  fairly  advanced.  A  second  approach,  which  also  uses  dynamic 
programming,  is  to  segment  and  Ubei  the  waveform  into  broad 
phonetic  classes  prior  to  time  alignment  (7).  A  mote  detailed  frame-by- 
frame  hbeling  is  then  achieved  by  a  second  dynamic  programming 
algorithm,  using  derivatives  of  energy  and  formant  Ainctkms. 

This  paper  describes  a  new  method  of  automatic  phonetic  alignment. 
This  method  utilizes  a  standard  pattern  classification  algorithm,  a 
dynamic  programming  algorithm,  and  the  constraints  imposed  by  our 
acoustic-phonetic  knowledge.  The  speech  signal  is  first  segmented  into 
broad  phonetic  classes  using  a  non-paramctric  pattern  classifier.  The 
resulting  string  is  then  aligned  with  the  transcription  using  a 
know  ledge- based  dynamic  programming  algorithm.  Acoustic  phonetic 
knowledge  is  utilized  extensively  in  the  feature  extraction  for  pattern 
classification,  the  specification  of  constraints  for  time-aligned  paths, 
and  the  subsequent  segmentation/ labeling  and  refinement  of 
boundaries. 

SYSTEM  DESCRIPTION 

The  basic  structure  of  the  system  that  we  have  developed  is  shown  in 
Figure  1.  The  speech  signal  is  digitized  at  16  kHz  and  captured  by  an 
automatic  end-point  detection  algorithm  (1).  From  the  speech  signal,  a 
number  of  parameters  are  computed  once  every  5  ms.  These 
parameters  are  then  used  in  conjunction  with  a  pattern  classifier  to 
produce  6  broad  phonetic  classes.  The  output  of  the  classifier  is  used  so 
time-align  major  and  robust  phonetic  events  with  the  transcription  in  a 
way  similar  to  that  proposed  by  Wagner  (7).  This  initial  time  alignment 
serves  as  anchor  points  for  subsequent  detailed  phonetic  alignment 
utilizing  a  set  of  heuristic  rules. 

IqjtiglBiMiCIgffMtaUM 

Ideally,  one  would  like  to  directly  classify  the  speech  signal  into 
segments  that  correspond  to  detailed  phonetic  events.  However,  the 
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task  it  difficult  due  u>  the  high  degree  of  acouaic  variability  in  the 
speech  signal.  Our  approach  it  (o'  make  an  initial  broad  classification 
relying  on  traditional  statistical  pattern  techniques.  The  objective  is  to 
determine  robua  acouaic -phonetic  events  that  are  relatively  context- 
independent.  and  to  use  these  evens  at  anchor  points  for  more  detailed 
analysis.  We  have  chosen  to  structure  the  classifier  as  a  sequence  of 
binary  classifiers  arranged  in  a  binary  decision  tree.  One  possible 
advantage  of  using  a  sequence  of  classifiers  is  that  a  different  feature 
vector  can  be  used  for  each  classifier  in  order  to  maximize  (he  contrasts 
between  (he  two  possible  output  classes.  For  example,  zero-crossing 
rate  is  hclpfiil  for  distinguishing  sonorants  and  obstruents,  but  not  for 
distinguishing  vowels  from  other  voiced  consonants.  Thus  the  problem 
of  classifying  the  speech  signal  into  different  classes  can  be  reduced  to  a 
sequence  of  sub-problems,  which  are  relatively  easier  to  tackle. 

At  each  node  in  the  decision  tree,  a  binary  decision  is  made  by  a 
pattern  classification  machine  as  shown  in  Figure  2.  The  structure  of 
each  of  the  classifiers  is  identical;  the  only  difference  is  the  feature 
vectors  and  initial  seed  points  used  in  the  clustering  algorithm.  Ihese 
classifiers  have  no  knowledge  about  the  transcriptions,  make  no 
assumptions  about  the  distributions  of  the  feature  parameters  and  need 
no  training.  Each  classifier  starts  with  a  set  of  M  parameters  selected 
based  on  acoustic-phonetic  knowledge  and  computed  once  every  5  ms. 
Note  that  the  number  of  parameters  used.  M,  may  be  different  for  each 
of  the  binary  classifiers.  I  he  parameters  are  then  processed  through  a 
seven-point  median  smoother,  clipped,  and  normalized.  Cipping  is 
intended  to  emphasize  the  portkr-  of  the  speech  signal  when 
boundaries  are  likely  to  occur.  Qipping  thresholds  are  selected 
conservatively  such  that  segment  boundaries  fall  within  the  transitional 
regions.  Normalization  then  transforms  each  of  the  clipped  feature 
parameters  to  the  same  scale.  The  clipping  and  normalization 
procedure  effectively  assigns  different  weights  to  different  feature 
parameters  depending  on  how  much  the  feature  parameter 
distributions  of  the  two  classes  overbp. 

Every  5  ms.  an  M-dimensional  feature  vector  is  obtained.  All  the 
feature  vectors  for  a  given  sentence  constitute  samples  in  the  feature 
space.  A  binary  decision  is  made  in  the  M-dimcnsional  feature  space 
by  means  of  K-Mcans  clustering,  using  a  Euclidean  distance  metric  J2J. 
It  is  well  known  that  the  location  of  the  duster  centers  and  the  speed  of 
convergence  for  a  clustering  algorithm  depends  on  the  choice  of  the 
initial  seed  points,  the  number  of  dusters,  and  the  geometrical 
distributions  of  the  data.  The  use  of  a  binary  classifier  has  (he 
advantage  that  the  algorithm  is  guaranteed  to  converge.  In  addition, 
the  binary  classifier  enables  us  to  apply  our  acoustic-phonetic 
knowledge  and  select  initial  seed  points  at  the  extrema  of  the  feature 
space  to  maximize  the  contrast  We  found  that  the  algorithm  typically 
converges  after  less  than  10  iterations. 

At  the  top  of  the  decision  tree,  the  clustering  algorithm  assigns  one  of 
two  labels  to  every  frame  of  data.  Each  group  of  data  will  pass  through 
a  different  classifier  at  a  lower  node,  and  the  process  repeats.  Our 
experience  has  shown  that  the  brand  phonetic  classifier  performs  very 
well  if  the  total  number  of  classes  is  small  siy  6  or  7.  The  performance 
of  the  classifier  degrades  substantially  when  one  attempts  to  use  H  to 
make  fine  phonetic  distinctions,  In  our  implementation,  the  classifier 
assigns  one  of  six  labels  to  every  frame  of  the  data:  S  (vowel-like 
sonorant).  O  (obstruent).  Vo  (voiced-obstruent).  Si  (silence),  B  (nasals 
and  voice-bars),  and  Ul  (unlabeled  segments).  Ui  is  assigned  to 
segments  with  clear  evidence  of  energy  dip  in  the  vowel-like  sonorant 
regions.  A  context-dependent  median  smoother  is  used  to  remove 
spurious  segments,  although  it  is  rarely  needed. 

Figure  Z  *>c ws  the  spectrogram  and  waveform  of  the  sentence,  "A 


tusk  is  used  to  make  costly  gifts’,  spoken  by  a  male  speaker.  The 
output  of  the  broad  phonetic  classifier  is  shown  m  row  (a).  The  vertical 
lines  drawn  in  the  spectrogram  are  at  segment  boundaries  determined 
by  (he  classifier. 

Alignment 

The  output  of  the  initial  classifier  is  a  broad,  but  presumably  robust, 
description  of  the  significant  acoustic  phonetic  events  in  the  speech 
signal.  In  order  to  use  this  broad  phonetic  description  as  anchor  points 
for  more  detailed  analyses,  the  broad  representauon  must  now  be 
aligned  with  the  phonetic  transcription.  This  is  essentially  a  path 
searching  problem,  and  we  have  chosen  to  use  dynamic  programming, 
where  the  path  is  heavily  constrained  by  acoustic  phonetic  rules. 
Figure  4  illustrates  how  this  is  done  for  the  same  utterance  as  shown 
previously.  The  horizontal  dimension  represents  the  output  of  the 
classifier,  while  the  vertical  dimension  represents  the  actual 
transcription.  Durational  information  is  used  by  (he  algorithm,  but  is 
not  explicitly  represented  in  this  figure.  Two  kinds  of  constraints  direct 
the  algorithm  to  search  for  the  correct  path.  First,  the  path  is  not 
allowed  to  traverse  through  certain  cells,  since  this  will  produce 
unplausible  phonetic  alignments.  These  mismatches  are  stored  as  a  set 
of  context-independent  rules,  and  the  resulting  cells  are  marked  in  the 
figure  by  an  x.  For  example,  the  first  phoneme  A/  is  not  allowed  to 
match  a  silence  or  a  sonorant  segment.  Second,  there  is  a  set  of  rules 
that  eliminates  certain  matches  based  on  contextual  information.  These 
cells  are  marked  in  the  figure  with  an  open  circle.  For  example,  the 
first  /t/  is  not  allowed  to  match  the  second  obstruent  due  to  a 
durational  constraint  The  filled  circles  denote  the  optimum  path, 
subject  to  a  predetermined  set  of  cost  hi  notions.  As  can  been  seen  ftont 
this  example,  the  acoustic-phonetic  constraints  can  often  reduce  ft* 
number  of  possible  paths  dramatically. 

Row  (b)  of  Figure  3  shows  the  results  of  the  time  alignment  TMs 
sentence  contains  an  example  where  two  segments  were  aligned  with 
one  phoneme  at  time  t=  1.85  sec.  It  also  contains  three  examples  where 
two  phonemes  were  aligned  with  one  segment  at  t«  1.0  sec,  t»  1.4S  sec 
and  t=  1.5S  sec.  In  these  latter  cases,  further  segmentation  is  clearly 
needed. 

Knowledge- bnsed  Segmentation 

The  dynamic  programming  algorithm  described  previously  divides 
the  speech  signal  into  a  sequence  of  segments.  Each  segment  is  mapped 
into  one  or  more  phonetic  symbols  or  events.  No  further  processing  fa 
needed  if  the  matching  is  one  or  more  segments  to  one  phoneme.  For 
those  segments  which  correspond  to  2  or  more  phonetic  events,  further 
segmentation  is  achieved  by  applying  a  set  of  heuristic  rules.  The 
transition  between  some  phonetic  events  is  gradual  and  is  not  marked 
by  any  distinct  acoustic  cues,  in  these  cases,  we  have  chosen  to  mark 
the  boundary  by  using  a  set  of  ad  hoc.  but  consistent  roles.  For 
example,  pre-  and  post-vocalic  liquids  next  to  cetuin  vowels  are 
assumed  to  have  a  duration  that  constitutes  one-third  of  the  syllable 
nucleus.  Row  (c)  of  Figure  3  shows  the  results  after  the  application  of 
such  ad  hoc  rules.  Note  that  the  pre-vocalic  f\J  at  time  t*  1.35  sec.  has 
been  delineated  from  the  following  voweL 

The  transitions  between  some  phonetic  events  are  pronounced  and 
are  marked  by  clear  acoustic  cues.  In  theses  cases,  further  segmentation 
is  accomplished  by  a  proper  selection  of  feature  parameters  and 
algorithms  based  on  contextual  information.  Row  (d)  in 
Figure  3  contains  two  examples,  at  time  t=  1  sec.  and  t=  1.43  sec.  The 
output  of  the  feature-based  segmentation  compares  favorably  with  the 
hand  transcription  shown  in  row  (e). 

Some  segments  are  mapped  into  multiple  phonemes.  Two  examples 
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are  shown  in  Figure  $.  We  are  continuously  adding  and  refining  the 
phonetic  rules.  With  the  proper  rules  and  features,  it  is  hoped  that 
these  problems  will  ultimately  be  solved. 

EVALUATION 

The  automatic  transcription  alignment  system  described  in  the 
previous  section  was  evaluated  using  sixty  sentences  spoken  by  three 
speakers,  two  male  and  one  female,  representing  approximately  ISO  sec. 
of  speech.  The  sentences  were  selected  from  the  Harvard  list  of 
phonetically  balanced  sentences  All  three  speakers  read  the  same 
twenty  sentences  All  sentences  were  transcribed  and  manually  aligned 
by  an  experienced  transcriber.  There  arc  approximately  1800  phonetic 
events  in  the  transcription.  For  comparison,  five  of  the  sixty  sentences, 
selected  at  random,  were  manually  labeled  by  a  second  transcriber.  The 
entire  process  of  manual  labeling  took  upwards  of  15  hours 

Figure  6  summaries  the  number  of  phonetic  events  that  matches  to 
one  segment  after  two  different  stages  of  processing.  Approximately 
80%  of  the  time  there  is  a  one-to-one  correspondence  after  time 
alignment  whereas  the  final  results  is  over  90%.  In  other  words,  only  7 
%  of  the  segments  requires  further  segmentation.  This  segmentation 
can  presumably  be  accomplished  as  more  rules  and  features  are  used. 
On  the  other  hand,  they  can  also  be  corrected  by  manual  intervention. 

Figure  7  fa)  shows  the  cumulative  distribution  of  the  boundary 
offsets  between  the  automatic  alignment  system  and  the  first  transcriber 
for  the  sixty  sentences.  Approximately  75%  of  the  time  this  offkct  is  less 
than  10  msec.  Figure  7  (b)  shows  the  boundary  offsets  between  the  two 
transcribers  for  five  of  the  sixty  sentences.  In  this  case  approximately 
80%  of  the  time  the  offset  is  less  than  10  msec 

SUMMARY 

In  this  paper  we  described  a  system  that  automatically  aligns  a 
phonetic  transcription  with  the  corresponding  speech  waveform.  The 
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system  performs  initial  classification  by  a  pattern  classification 
algorithm.  The  output  of  the  classifier  is  used  to  determine  "islands  of 
reliability"  for  further  segmentation.  Acoustic  phonetic  knowledge  is 
used  extensively  during  classification,  time-alignment  and  feature- 
based  segmentation.  We  arc  encouraged  by  the  preliminary  results,  and 
are  hopeful  that  this  system  will  play  a  major  role  in  establishing  a  large 
database  for  speech  research. 
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ABSTRACT 

The  TIMIT  acoustic-phonetic  database  was  designed  jointly  by  researchers  at  MIT,  TI, 
and  SRI.  It  was  intended  to  provide  a  rich  collection  of  acoustic  phonetic  and  phonological 
data,  to  be  used  for  basic  research  as  well  as  the  development  and  evaluation  of  speech 
recognition  systems.  The  database  consists  of  a  total  of  6,300  sentences  from  630  speakers, 
representing  over  5  hours  of  speech  material,  and  was  recorded  by  researchers  at  TI.  This  pa¬ 
per  describes  the  transcription  and  alignment  of  the  TIMIT  database,  which  was  performed 
at  MIT. 

1  BACKGROUND 

When  the  DARPA  Strategic  Computing  speech  program  was  first  formulated  in  1984, 
the  consensus  of  the  research  community  was  that  the  amount  of  speech  data  available  is 
woefully  inadequate.  As  a  result,  a  significant  effort  on  database  development  was  mounted 
in  order  to  provide  the  research  community  with  a  large  body  of  acoustic  data  for  research, 
system  development,  and  performance  evaluation.  One  such  database  is  the  so-called  TIMIT 
acoustic-phonetic  database.  The  TIMIT  database  was  designed  jointly  by  researchers  at 
MIT,  TI,  and  SRI.  It  consists  of  a  total  of  6,300  sentences  from  630  speakers,  representing 
over  5  hours  of  speech  material,  and  was  recorded  by  researchers  at  TI.  This  paper  describes 
the  transcription  and  alignment  of  the  TIMIT  database,  which  was  performed  by  researchers 
at  MIT. 

Each  speaker  in  the  TIMIT  database  recorded  10  sentences  drawn  from  three  different 
corpora  as  follows.  Each  speaker  read  two  sentences,  designated  as  SI  and  S2,  which  were 
designed  by  Jared  Bernstein  of  SRI  in  order  to  compare  dialectal  and  phonological  variations 
across  speakers.  Five  sentences,  designated  as  SX  sentences,  were  drawn  from  a  small  set  of 
sentences  designed  at  MIT.  The  remaining  three  sentences  for  each  speaker,  designated  as 
SI  sentences,  were  selected  from  the  Brown  corpus  by  Bill  Fisher  of  TI  [1]. 

There  are  a  total  of  450  “MIT”  sentences  used  in  the  TIMIT  database.  These  were  gen¬ 
erated  by  hand  in  an  iterative  fashion,  with  the  goal  that  they  should  be  phonetically  rich. 
Care  was  taken  to  have  as  complete  a  coverage  of  left-  and  right-  context  for  each  phone  as 
possible.  Some  of  the  more  problematic  sequences,  such  as  rowel- vowel  and  stop-stop,  were 
particularly  emphasized.  An  attempt  was  also  made  to  ensure  that  many  of  the  frequently- 
occurring  low-level  phonological  rules  were  adequately  represented.  To  aid  in  the  sentence 
generation  process,  we  made  use  of  an  on-line,  Webster’s  Pocket  Dictionary  containing  nearly 
20,000  words.  Words  or  word-sequences  containing  particular  phone  pairs  could  be  accessed 


from  this  dictionary  automatically,  which  greatly  facilitated  the  database  design  process.  We 
performed  a  detailed  analysis  of  the  resulting  sentence  set,  as  well  as  the  SI  sentences  that 
make  up  the  remainder  of  the  database.  The  interested  reader  should  consult  Lame]  et  al. 
[3]  for  further  information  about  the  corpora. 

2  THE  ACOUSTIC  PHONETIC  LABEL  SET 

All  of  the  recorded  sentences  were  provided  with  a  time-aligned  sequence  of  acoustic- 
phonetic  labels.  The  label  set  is  intended  to  represent  a  level  somewhat  intermediate  between 
phonemic  and  acoustic.  Our  motivation  was  that  clear  acoustic  boundaries  in  the  waveform 
should  all  be  marked,  and  that  the  criteria  for  positioning  the  boundaries  between  units 
should  in  part  be  based  on  our  ability  to  mark  them  consistently.  Table  1  lists  all  of  the 
acoustic-phonetic  labels  that  were  used.  Most  of  these  labels  are  phonemic,  although  several 
symbols  have  been  included  for  labelling  acoustically  distinct  allophones  as  well  as  other 
special  acoustic  events. 

2.1  Stops 

Stops  are  characterized  by  a  sequence  of  two  events:  a  closure  and  a  release.  This 
departure  from  phonemic  form  is,  we  believe,  important  in  order  to  preserve  a  boundary 
marking  the  onset  of  the  release.  There  are  six  closure  symbols  for  the  stops.  The  closure 
region  for  affricates  is  identical  with  that  of  the  corresponding  alveolar  stop  (e.g.,  the  /£/  in 
“chat”  is  represented  as  [t°C]). 

There  are  two  major  allophones  for  the  stops.  The  glottal  stop,  [?],  is  often  inserted 
preceding  a  word-initial  vowel.  Sometimes  a  /t/  can  also  be  realized  as  a  glottal  stop,  as 
in  "cotton”.  The  symbol  [r]  is  used  to  label  a  flap,  which  can  be  either  an  underlying  /t/ 
or  /d/.  We  make  a  separate  flapping  decision  for  every  phonemic  /t/  and  /d/,  based  on 
listening  and  the  spectrographic  evidence.  We  allow  flapping  to  occur  in  environments  for 
which  theory  is  violated,  if  in  fact  we  believe  that  flap  is  what  was  heard/seen. 

2.2  Nasals  and  Semivowels 

We  recognize  four  allophones  for  the  nasals,  three  of  them  are  the  syllables,  [m  n  q].  If 
there  is  any  evidence  of  a  preceding  schwa,  the  non-syllable  form  is  preferred.  The  alveolar 
nasal,  /n/  can  be  realized  as  a  nasal  flap,  denoted  by  the  symbol  [7].  Sometimes  an  underlying 
/nt /  sequence  is  realized  as  a  nasal  flap,  as  in  "entertain.” 

The  liquid,  /!/,  has  a  syllabic  allophone,  denoted  as  [j].  Again,  a  non-syllabic  form  is 
preferred  whenever  a  preceding  schwa  is  observed. 

2.3  Vowels 

Two  vowels,  /i(  o/,  are  represented  by  symbols  that  included  their  corresponding  off- 
glides.  This  is  because  they  are  usually  realized  as  diphthongs  in  American  English.  The 
four  diphthongs,  /ay/,  ‘/aw/,  /o*/,  and  /e*/,  are  each  represented  as  a  single  label,  with  no 
separate  region  defined  for  the  off-glide  portion.  The  retroflexed  vowel  /?/  is  also  represented 
as  a  single  unit.  This  represents  a  departure  from  the  International  Phonetic  Alphabet,  which 
would  represent  this  steady-state  vowel  as  the  sequence  /ar/. 

Reduced  vowels  are  represented  by  four  separate  allophones:  back  schwa  ([»]),  front  schwa 
([{)),  retroflexed  schwa  ([»]),  and  voiceless  schwa  (()]).  The  decision  for  [o]  vs  [fj  is  based  on 
whether  the  second  formant  is  closer  to  the  first  or  to  the  third.  A  low  third  formant  leads 


to  /»/.  Schwas  can  often  be  devoiced  in  words  such  as  “secure”. 

English  does  not  distinguish  phonemically  between  the  fronted  rowel  /u/  and  the  stan¬ 
dard  back  /« /;  howerer  the  difference  in  Fj  for  the  two  forms  can  be  as  much  as  800  Bs. 
We  felt  it  was  unsatisfactory  to  group  two  forms  with  such  diverse  formant  frequencies  into 
the  same  rowel  category.  The  decision  is  made  as  for  schwa:  if  Fj  is  closer  to  Ft,  it’s  con¬ 
sidered  a  back  /u/.  Similar  trends  of  fronting  are  also  observed  for  /of  and  /u /  in  certain 
environments;  however,  the  effect  is  most  dramatic  for  /u/. 

At  present,  we  make  no  attempt  to  provide  further  sub-phonemic  characterisations  for 
vowels  other  than  this  front/back  distinction  for  /u /  and  the  four  schwas.  For  instance, 
many  vowels  are  nasalised  when  they  are  followed  by  a  nasal,  or  lateralised  when  followed 
by  an  /l/.  Such  information  would  surely  be  useful,  but  the  decision-making  process  is  prone 
to  judgment  error,  and  would  require  a  significant  increase  in  time  and  effort. 

2.4  Others 

We  make  a  distinction  between  two  types  of  /h/:  voiced  ([ft))  and  unvoiced  ([h]).  The 
decision  is  based  mainly  on  an  examination  of  the  waveform  for  clear  low-frequency  period¬ 
icity,  and  spectrogram  for  voicing  striations.  The  voiced  form  is  most  common  between  two 
vowels. 

Our  label  set  includes  a  category  “epenthetic  silence,”  0,  which  we  use  to  mark  acous¬ 
tically  distinct  regions  of  weak  energy  separating  sounds  that  involve  a  change  in  voicing. 
These  short  gaps  are  typically  due  to  articulatory  timing  errors.  The  most  common  occur¬ 
rences  of  such  gaps  are  between  an  /s/  and  a  semivowel  or  nasal,  as  in  “small,”  "swift,  ”  or 
"prince.”  Two  other  non-phonetic  symbols  are  included:  #  is  used  to  mark  regions  preceding 
and  following  a  sentence,  and  □  is  used  to  mark  pauses  within  a  sentence. 

3  CRITERIA  FOR  BOUNDARY  ASSIGNMENTS 

The  acoustic-phonetic  transcription  for  the  TIMIT  sentences  is  time  aligned  with  the 
speech  waveform.  The  alignment  is  useful  in  that  specific  acoustic  events  can  be  accessed 
conveniently  based  on  the  transcription.  We  must  stress,  however,  that  the  aligned  tran¬ 
scription  is  intended  to  establish  a  correspondence  between  the  transcription  and  important 
acoustic  landmarks.  One  should  not  directly  associate  a  region  between  two  time  markers 
as  a  distinct  phonetic  unit,  since  the  encoding  of  phonetic  information  in  the  speech  signal 
is  extremely  complicated. 

In  most  cases,  the  boundaries  between  two  acoustic-phonetic  events  are  clear  and  well- 
defined,  such  as  that  between  a  stop  closure  and  its  release.  However,  there  are  a  number 
of  cases  where  the  exact  placement  of  a  boundary  is  problematic  (as  is  the  case  between  a 
semivowel  and  a  vowel),  or  cases  where  it’s  not  clear  whether  a  region  should  be  represented 
as  one  or  two  acoustic-phonetic  units  (as  is  the  case  for  diphthongs).  In  these  cases,  we  tried 
to  define  a  set  of  criteria  that  would  be  systematic  and  least  subject  to  human  error,  in  order 
to  produce  boundary  positionings  that  were  as  consistent  as  possible. 

As  mentioned  previously,  we  decided  that  the  boundary  between  the  closure  interval  and 
the  release  of  a  stop  is  an  important  one  that  should  be  assigned.  It  is  certainly  a  very 
distinct  landmark  in  the  waveform.  Anyone  interested  in  studying  the  burst  characteristics 
of  a  stop  would  then  be  able  to  focus  on  just  that  region  that  includes  only  the  released 
portion.  In  a  strictly  phonemic  representation,  the  closure  and  release  would  be  represented 
as  a  single  unit,  and  therefore  that  critical  boundary  would  remain  unmarked. 


A  problematic  boundary  is  one  that  separates  a  prevocalic  stop  from  a  following  semivowel, 
as  in  “truck."  Typically,  part  of  the  /r/  is  devoiced,  and  therefore  is  absorbed  into  the  as* 
piration  portion  of  the  stop.  If  listening  were  the  only  criterion,  then  the  left  boundary 
of  the  fxf  would  occur  somewhere  in  the  aspiration,  and  the  right  boundary  would  occur 
somewhere  after  voicing  onset.  A  clear  acoustic  boundary  at  the  point  of  voice  onset  would 
remain  unmarked.  It  would  also  be  difficult  to  dedde  where  to  mark  the  boundary  between 
the  stop  burst  and  the  aspirated  /r/  portion.  Since  voice-onset  time  (VOT)  is  a  parame¬ 
ter  that  has  been  a  focus  of  many  research  efforts,  it  seems  unsatisfactory  not  to  include 
a  reliable  mechanism  for  measuring  VOT  based  on  the  labelled  boundaries.  Therefore,  we 
adopted  the  policy  of  always  absorbing  into  the  stop  release  all  of  the  unvoiced  portion  of  a 
following  vowel  or  semivowel. 

The  boundary  between  many  semivowels  and  their  adjacent  vowels  is  rather  ill-defined 
in  the  waveform  and  spectrogram,  because  transitions  are  slow  and  continuous.  It  is  not 
possible  to  define  a  single  point  in  time  that  separates  the  vowel  it om  the  semivowel.  In 
such  cases,  we  decided  to  adopt  a  simple  heuristic  rule,  in  which  one-third  of  the  vocalic 
region  is  assigned  to  the  semivowel,  thus  giving  the  vowel  twice  the  duration  of  the  adjacent 
semivowel.  Previous  investigators  have  also  made  use  of  such  consistent  rules  for  defining 
acoustically  ambiguous  boundaries  [4]. 

One  obscure  condition  is  a  /ts/  or  /dz/  sequence,  where  typically  there  is  little  or  no 
spectral  change  to  characterize  a  boundary  between  the  homorganic  stop  and  fricative,  yet 
the  onset  of  acoustic  energy  of  the  unit  is  sufficiently  abrupt  such  that  a  /l/  is  heard.  Our 
convention  here  is  that,  if  a  clear  /(/  is  heard,  the  early  portion  of  the  /s/  is  marked  as  a 
/t/  release. 

When  gemination  occurs,  we  do  not  attempt  to  mark  a  boundary  between  the  two  units. 
This  situation  occurs  exclusively  at  word  boundaries,  as  in  “some  money."  Furthermore,  in 
the  case  of  a  stop-stop  sequence  where  the  first  stop  is  unreleased,  the  closure  interval  is 
assigned  to  the  first  stop  and  the  release  to  the  second  one. 

4  PROCEDURE  FOR  TRANSCRIPTION  AND  ALIGNMENT 

The  transcription  and  alignment  process  involves  three  stages: 

1.  An  acoustic- phonetic  sequence  is  entered  manually  by  a  transcriber  as  a  string. 

2.  The  speech  waveform  is  aligned  automatically  with  the  acoustic-phonetic  sequence, 
using  an  alignment  program  developed  at  MIT. 

3.  The  boundaries  generated  automatically  are  then  hand  corrected  by  experienced  acous¬ 
tic  phoneticians. 

4.1  Transcription 

In  both  stages  1  and  3,  the  labeller  makes  her/his  acoustic-phonetic  decision  based  on 
careful  listening  of  portions  of  the  speech  waveform,  as  well  as  visual  examination  using 
displays  such  as  the  spectrogram  and  the  original  waveform.  The  process  takes  place  within 
the  SPIRE  software  facility  for  speech  analysis,  a  powerful  interactive  tool  that  is  well- 
matched  to  this  task  [2].  Stage  1  requires  less  intensive  use  of  SPIRE  than  stage  3,  because 
it  is  only  necessary  to  record  what  was  heard,  without  identifying  the  time  locations  of  the 
events.  Furthermore,  minor  errors  of  judgment  made  at  this  stage  can  be  readily  corrected 
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Figure  1:  SPIRE  layout  for  entering  the  acoustic-phonetic  transcription 


in  stage  3.  The  labels  can  be  entered  either  by  typing  or  by  mousing  a  displayed  set.  Figure 
1  shows  the  SPIRE  layout  used  for  entering  the  transcription.  The  completed  transcription 
is  shown  in  the  top  window  of  this  display. 

®  In  general,  we  try  to  label  what  we  hear/see  rather  than  what  we  expect.  Thus,  if  a  person 

says  “imput”  for  “input,”  the  nasal  will  be  marked  as  an  /m/.  However,  in  conditions  of 
ambiguity,  the  underlying  phonemic  form  is  selected  preferentially. 

4.2  Automatic  Alignment 

The  alignment  of  a  phonetic  transcription  with  the  corresponding  speech  waveform  is 
essential  for  making  use  of  the  database  in  speech  research,  since  time-aligned  phonetic  tran¬ 
scriptions  provide  direct  access  to  specific  phonetic  events  in  the  waveform.  Traditionally, 
this  alignment  is  done  manually  by  a  trained  acoustic-phonetician.  This  is  an  extremely 
time-consuming  procedure,  requiring  the  expertise  of  one  or  a  very  small  number  of  peo- 
I 1-  pie.  Therefore,  the  amount  of  data  that  can  be  labeled  is  limited.  In  addition,  manual 

labeling  often  involves  decisions  which  are  highly  subjective,  and  thus  the  results  can  vary 
substantially  from  one  person  to  another. 

Transcription  alignment  of  the  TIMIT  database  makes  use  of  CASPAR,  an  automatic 
alignment  system  developed  at  MIT.  Descriptions  of  preliminary  implementation  of  the 
|  <  system  can  be  found  elsewhere  (5,6).  Basically,  the  alignment  is  accomplished  by  the  system 

in  three  steps.  First,  each  5  ms  (tame  of  the  speech  data  is  assigned  to  one  of  five  broad- 
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class  labels:  sonorani,  obstruent,  voiced-consonant,  nasat/voicebar ,  and  silence ,  using  a  non¬ 
par  ametric  pattern  classifier.  The  assignment  process  makes  nse  of  a  binary  decision  tree, 
based  on  a  set  of  acoustically  motivated  features.  Each  sequence  of  identically-labelled 
frames  is  then  collapsed  into  a  segment  of  the  same  label,  thus  establishing  a  broad-class 
segmentation  of  the  speech.  The  output  of  the  initial  classifier  is  then  aligned  with  the 
phonetic  transcription  using  a  search  strategy  with  some  look-ahead  capability,  guided  by  a 
few  acoustic  phonetic  rules.  For  those  segments  which  correspond  to  two  or  more  phonetic 
events  after  preliminary  alignment,  further  segmentation  is  done  using  specific  algorithms 
based  on  knowledge  of  the  phonetic  context.  In  some  cases  heuristic  rules  are  invoked  (as 
between  a  vowel  and  a  semivowel)  to  assign  consistent,  but  somewhat  arbitrary  boundaries. 

Over  the  past  two  years,  two  major  modifications  of  CASPAR  have  taken  place.  First, 
the  alignment  of  the  broad-class  acoustic  labels  with  the  phonetic  symbols  has  been  cast  into 
a  probabilistic  framework.  By  using  a  large  body  of  training  data,  a  set  of  robust,  context- 
dependent  and  durational  statistics  were  obtained.  Second,  a  fourth  module  has  been  added 
to  the  system  to  improve  the  resolution  of  the  boundaries.  This  module  computes  appropriate 
acoustic  attributes  at  a  high  analysis  rate  using  different  window  shapes  that  depend  on  the 
specific  context.  The  boundaries  are  then  adjusted  based  on  these  attributes. 

In  a  formal  evaluation,  it  was  found  that  CASPAR  can  correctly  perform  over  95%  of  the 
labeling  task  previously  done  by  human  transcribers.  The  boundary  locations  produced  by 
the  system  agree  well  with  those  produced  by  human  transcribers.  For  example,  over  75% 
of  the  automatically  generated  boundaries  were  within  10  msec  of  a  boundary  entered  by  a 
trained  phonetician. 

Figure  2  displays  the  output  for  the  sentence,  "She  had  your  dark  suit  in  greasy  wash 
water  all  year.”  The  transcription  and  boundaries  are  overlaid  on  the  spectrogram  for  ease 
of  examination.  For  this  example,  most  of  the  boundaries  have  been  found  correctly  by 
CASPAR.  Note,  however,  that  boundaries  are  missing  in  the  [ifise]  sequence  of  “She  had." 
The  waveform  displays  the  word  “dark”  and  the  [s]  of  “suit.”  Note  that  the  initial  boundary 
of  the  first  [d]  is  slightly  too  far  forward  in  time. 

4.3  Post-Processing 

The  final  step  is  to  correct  by  hand  any  errors  in  the  automatically  aligned  acoustic- 
phonetic  sequence.  Some  of  the  errors  are  due  to  the  fact  that  CASPAR  is  not  able  to 
determine  certain  boundaries,  such  as  some  of  those  between  two  vowels.  In  other  cases  the 
boundaries  may  have  been  misplaced. 

Hand  correction  of  the  aligned  transcription  is  based  on  critical  listening  of  portions  of 
the  utterance  as  well  as  visual  examination  of  the  spectrogram  and  the  waveform.  The 
spectrogram  covers  close  to  3  seconds  worth  of  speech  at  one  time,  whereas  the  waveform  is 
displayed  on  a  much  more  expanded  time  scale.  For  example,  to  accurately  mark  the  onset 
of  the  release  of  a  stop,  the  cursor  is  first  positioned  on  the  spectrogram  at  the  approximate 
point  in  time.  The  waveform  display  automatically  moves  to  synchronize  in  time  with  the 
cursor,  and  a  fine-tuning  of  the  boundary  can  be  achieved  by  mousing  the  exact  time  point 
in  the  waveform. 

The  mouse  can  be  used  with  ease  to  move  an  existing  boundary  to  a  new  point  in  time, 
to  erase  a  boundary,  or  to  insert  a  boundary.  Furthermore,  a  specified  mouse  click  on 
any  segment  allows  the  labeller  to  change  the  acoustic-phonetic  label  associated  with  that 
segment.  This  step  is  sometimes  necessary  to  correct  an  error  of  judgment  made  in  stage  1. 

An  example  of  the  screen  layout  used  for  the  correction  process  is  shown  in  Figure  3.  The 


Figure  2:  SPIRE  layout  showing  the  alignment  produced  by  CASPAR 


boundary  for  the  [d]  burst  onset  has  been  corrected.  Missing  boundaries  were  inserted  for 
the  (ifias)  sequence.  In  addition,  the  boundaries  associated  with  the  first  [w]  were  extended 
on  both  sides,  and  an  epenthetic  silence  was  inserted  between  the  [S]  and  the  following  [w]. 

5  CONCLUDING  REMARKS 

Once  the  acoustic- phonetic  transcription  has  been  aligned,  it  is  rather  straightforward 
to  propagate  the  alignment  up  to  the  orthographic  transcription  as  well  as  the  intermediate 
phonemic  transcription.  A  time-aligned  orthographic  transcription  is  useful  when  searching 
for  a  specific  word,  while  a  time-aligned  phonemic  transcription  can  be  used  to  relate  the 
lexical  representation  of  words  to  their  acoustic  realisations.  For  example,  the  lexical  repre¬ 
sentation  of  the  word  sequence  “gas  shortage”  contains  a  word-final  /$/  and  a  word-initial 
/!/,  whereas  its  acoustic  realisation  may  simply  be  a  long  [S].  In  this  case,  the  time-aligned 
phonemic  transcription,  will  map  the  long  [5]  to  both  the  underlying  fricatives.  Researchers 
interested  in  studying  the  frequency  of  occurrence  of  certain  low-level  phonological  rules  will 
thus  be  able  to  derive  the  information  from  these  transcriptions. 

We  have  developed  a  system  that  maps  a  time-aligned  acoustic-phonetic  transcription  to 
the  phonemic  and  orthographic  transcriptions  [7].  However,  the  alignment  effort  for  these 
transcriptions  lags  somewhat  behind  the  phonetic  alignment.  In  the  interest  of  expeditiously 
making  as  much  data  available  to  the  interested  parties,  we  have  decided  to  provide  these 
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Figure  3:  SPIRE  layout  showing  the  aligned  transcription  following  post-processing 


other  transcriptions  in  future  releases. 

The  transcription  and  alignment  of  the  TIMIT  database  is  a  sisable  project.  At  this 
writing,  all  of  the  sentences  have  been  processed  and  delivered  to  the  National  Bureau  of 
Standards.  A  significant  portion  of  the  database  is  now  available  to  the  general  public  via 
magnetic  tapes,  and  plans  for  distributing  them  by  way  of  compact  disc  is  well  under  way. 
Despite  our  best  intention  to  provide  as  correct  a  set  of  transcriptions  as  possible,  however, 
errors  undoubtedly  exists.  We  urge  users  of  this  database  to  communicate  errors  to  us 
whenever  possible,  so  that  future  users  can  benefit  from  this  effort. 

Finally,  we  would  like  to  thank  Dave  Pallett,  Jim  Hieronymus,  and  their  colleagues  at  NBS 
for  the  cooperation,  patience,  and  good  humor  that  they  provided.  Their  help,  particularly 
regarding  data  transfer,  verification,  distribution,  and  fending  off  eager  inquiries,  have  been 
indispensable  to  this  project. 

The  development  of  the  TIMIT  database  at  MIT  was  supported  by  the  DARPA-ISTO 
under  contract  N00039-85-C-0341,  as  monitored  by  the  Naval  Space  and  Warfare  Systems 
Command.  Major  participants  of  the  project  at  MIT  include  Corine  Bickley,  Katy  Isaacs, 
Rob  Kassel,  Lori  Lamel,  Hong  Leung,  Stephanie  Seneff,  Lydia  Volaitis,  and  Victor  Zue. 
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